[WIP]Support DeepSeek V4 flash on SM120 with Triton fallback#40929
[WIP]Support DeepSeek V4 flash on SM120 with Triton fallback#40929bbbearxyz wants to merge 25 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Yongye Zhu <yongye@inferact.ai> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Simon Mo <simon@inferact.ai> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roy Wang <yasong.wang@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Zhewen Li <jerven.vllm@gmail.com> Co-authored-by: Zijing Liu <liuzijing2014@gmail.com> Co-authored-by: khluu <khluu000@gmail.com> Co-authored-by: qizixi <zixi@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: qizixi <zixi@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the DeepSeek V4 model architecture, featuring horizontally-fused MLA kernels, specialized MoE gating with softplus_sqrt, and MTP draft model integration. The changes include new CUDA and Triton kernels for optimized attention, quantization, and normalization, along with updates to Docker configurations and external dependencies like DeepGEMM and FlashMLA. Technical feedback identifies critical issues regarding the initialization of E8M0 scales, insufficient hardware capability guards for FP8 intrinsics in CUDA kernels which require SM89+, and a potential tensor reshape error in the Triton fallback logic.
aab48af to
c1418ec
Compare
943bd81 to
9fca2b9
Compare
Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn>
9fca2b9 to
b2a9e98
Compare
0a93f94 to
d521d3e
Compare
|
very nice. |
Cherry-picked from vllm-project#40929 commit b2a9e98. Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn> Signed-off-by: jasl <jasl9187@hotmail.com>
Cherry-picked from vllm-project#40929 commit b2a9e98. Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn> Signed-off-by: jasl <jasl9187@hotmail.com>
Cherry-picked from vllm-project#40929 commit b2a9e98. Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn> Signed-off-by: jasl <jasl9187@hotmail.com>
|
Please reference #38476 as well. Consideration for sm80/86/89 is also appreciated, since they can use TRITON as well. |
Issue:#40928
This PR is based on #40760
Tested on 2 x RTX Pro 6000 (SM120)
Summary
Support Triton fallback ops for DeepSeek V4 flash when DeepGEMM or FlashMLA is not available.
This PR adds a generic Triton implementation path for the DeepSeek V4 branch, including fallback kernels for sparse MLA attention, decode sparse attention, FP8 einsum, sparse attention indexer logits, and MHC prenorm GEMM. The existing optimized DeepGEMM / FlashMLA paths are still preferred when available; the Triton path is only used as a fallback.
Why
My approach for running DeepSeek V4 flash on SM120 is to provide a generic Triton implementation instead of hard-blocking execution on DeepGEMM or FlashMLA availability.
I think this is a reasonable fit for the vLLM DeepSeek V4 branch: when FlashMLA or DeepGEMM does not support a device yet, vLLM should still have a portable implementation that lets users run the model. Triton gives us a more general compatibility layer across GPU architectures, including SM120 and future SM architectures.
The goal of this PR is not to replace the optimized kernels. DeepGEMM and FlashMLA should remain the preferred paths when they are supported. However, when they are unavailable, the Triton fallback gives users a working implementation, even if there is still room for performance optimization.
This also keeps the migration cost low. If DeepGEMM adds SM120 support in the future, vLLM can switch SM120 back to the DeepGEMM path with minimal changes, while still keeping Triton as a portable fallback for other unsupported architectures.
Change
This PR supports DeepSeek V4 flash on SM120 by adding a generic Triton fallback path for kernels that currently depend on DeepGEMM or FlashMLA.
Main changes include:
Serving benchmark
random input len: 1024
random output len: 1024
num prompts: 32
max_model_len=8192
gpu_memory_utilization=0.9
TP=2, PP=1